Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change TerminatePhase return value #6335

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

knelli2
Copy link
Contributor

@knelli2 knelli2 commented Oct 16, 2024

Now TerminatePhase returns Parallel::AlgorithmExecution::Halt. Also added a new action PausePhase which returns
Parallel::AlgorithmExecution::Pause if that is needed.

This changes how the algorithm ends at the end of a phase.

Proposed changes

This is a fix for a bug in nodegroup parallization, but also a good thing to do. Once we get to the end of an PDAL (and we aren't looping), we shouldn't be allowed to restart as many phases are only intended to run once

Upgrade instructions

Code review checklist

  • The code is documented and the documentation renders correctly. Run
    make doc to generate the documentation locally into BUILD_DIR/docs/html.
    Then open index.html.
  • The code follows the stylistic and code quality guidelines listed in the
    code review guide.
  • The PR lists upgrade instructions and is labeled bugfix or
    new feature if appropriate.

Further comments

Now TerminatePhase returns Parallel::AlgorithmExecution::Halt.
Also added a new action PausePhase which returns
Parallel::AlgorithmExecution::Pause if that is needed.

This changes how the algorithm ends at the end of a phase.
@nilsvu nilsvu requested a review from nilsdeppe October 16, 2024 17:38
Copy link
Member

@wthrowe wthrowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a full review, but I think you need to respond to @kidder's concerns from elsewhere, reproduced below:

  • Retry tells the distributed object to wait for more data before resuming the algorithm. The algorithm is still considered to be running on the distributed object. A subsequent call to receive_data will attempt to run the same action after putting data into the Inbox
  • Halt stops the algorithm. Nothing can restart the algorithm of a halted distributed object until the new phase is determined. This is used to execute a triggered phase change such as checkpointing or AMR
  • Pause stops the algorithm, but lets either something calling receive_data(instance, data, true) or perform_algorithm(true) (where the trues override defaulted false arguments) on the paused distributed object to restart the algorithm.
  • The member variable terminate_ is true if the algorithm is stopped. This happens for either Halt or Pause being called
  • The member variable halt_algorithm_until_next_phase_ is set true when Halt is called
  • There are currently two instances of perform_algorithm(true), one is in an example Geoffrey uses in his tutorial, the other is in the linear solver action BuildMatrix
  • So calling perform_algorithm() on a Paused or Halted object should not have done anything. So I am confused by the [diagnosis of a restart being caused by a reduction]

Copy link
Member

@wthrowe wthrowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of code changes. Still need to be convinced this isn't just hiding some other bug.

step_actions>>;
// This test doesn't operate exactly like how SelfStart would normally work in
// an executable. Instead it jumps around quite a lot. Therefore to avoid any
// issues that TerminatePhase would cause, we just replace it with PausePhase.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just remove it. No reason to replace it with a no-op.

@@ -209,10 +209,10 @@ SPECTRE_TEST_CASE(
std::make_unique<::TimeSteppers::DormandPrince5>()));

// Run the initializations
for (size_t i = 0; i < 6; ++i) {
for (size_t i = 0; i < 5; ++i) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a ton of unexplained changes like this in this PR. What is going on?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 5 actions in the initialization phase for this component. Previously the last one returned Pause, so when it called ActionTesting::next_action one extra time, nothing bad happened. But now that the last action returns Halt, calling it an extra time causes an error. So it was a bug previously to call it 6 times. Same for the rest of the changes like this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants